Skip to content

perf(raft): raise MaxInflightMsgs=1024, MaxSizePerMsg=4MB defaults#593

Open
bootjp wants to merge 11 commits intomainfrom
perf/raft-message-size-tuning
Open

perf(raft): raise MaxInflightMsgs=1024, MaxSizePerMsg=4MB defaults#593
bootjp wants to merge 11 commits intomainfrom
perf/raft-message-size-tuning

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented Apr 23, 2026

Summary

Why

MaxSizePerMsg bump (the main win). Under small-entry KV workloads (Redis-style, median entry ~500 B), a 1 MiB cap per MsgApp forces one MsgApp per ~2000 entries. At 4 MiB, each MsgApp amortises ~4x more entries. Concrete effects:

  • Leader: fewer dispatcher goroutine wake-ups per committed byte -> lower CPU use per applied entry.
  • Follower: fewer recv syscalls; less contention between apply loop and read path.
  • Wire: no change in bytes-on-wire; only framing count.

MaxInflightMsgs bump (insurance). In-rack RTT is ~0.2 ms, so the replication-rate cap MaxInflight * MaxSize / RTT is already well above the NIC at the old defaults. The bump is for burst write phases where entries arrive faster than round-trip completion, and to align with the engine-internal 1024 already used for stepCh / dispatchReportCh sizing.

Worst-case memory bound. Per-peer buffered memory is MaxInflight * MaxSize = 1024 * 4 MiB = 4 GiB worst-case (was 256 * 1 MiB = 256 MiB). In practice MsgApp payloads are far smaller than the cap, so steady-state is a small fraction of this.

Env vars

Name Type Default Min Semantics
ELASTICKV_RAFT_MAX_INFLIGHT_MSGS int 1024 1 Per-peer in-flight MsgApp cap.
ELASTICKV_RAFT_MAX_SIZE_PER_MSG int (bytes) 4194304 1024 Per-MsgApp byte cap.

Env var wins over caller-supplied OpenConfig value so operators can retune a running deployment without a rebuild. Invalid values (non-numeric, zero/negative, sub-1 KiB size) log a warning and fall back to the compiled-in default. Chosen values are logged at info level at Open.

Before / after

Knob Before After
raft.Config.MaxInflightMsgs 256 (from main.go) 1024
raft.Config.MaxSizePerMsg 1 MiB 4 MiB
raft.Config.MaxCommittedSizePerReady 1 MiB 4 MiB

Qualitative: fewer dispatch wake-ups per committed byte under small-entry workloads; deeper pipelining during write bursts. No expected change on wire bandwidth or correctness.

Test plan

  • go test ./internal/raftengine/etcd/... -count=1 passes
  • go test ./... -count=1 passes across all packages
  • golangci-lint run ./internal/raftengine/... clean
  • Unit tests: env-unset -> defaults 1024 / 4 MiB propagate; valid override propagates to raft.Config; invalid values fall back to defaults with a warning
  • Existing TestSnapshotEveryFromEnv_* / TestMaxWALFilesFromEnv_* still pass (pattern mirrored)

Surprising findings

  • defaultMaxInflightMsg inside internal/raftengine/etcd/engine.go had already been bumped to 1024 (for stepCh sizing) and documented as such, but main.go was still passing 256 at the FactoryConfig boundary. The normalizer only fills defaults when the caller passes 0/<=0, so 256 leaked straight through to raft.Config.MaxInflightMsgs. This PR fixes that drift.
  • MaxInflightMsgs / MaxSizePerMsg were not previously overridden anywhere else in the tree.

Summary by CodeRabbit

  • New Features

    • Runtime tuning via ELASTICKV_RAFT_MAX_INFLIGHT_MSGS and ELASTICKV_RAFT_MAX_SIZE_PER_MSG with clear precedence and safety clamping; invalid values log warnings and fall back to safe defaults.
  • Chores

    • Increased default Raft limits (inflight msgs: 256→512; per-message size: 1MiB→2MiB) and added a minimum inbound channel capacity to improve stability.
  • Tests

    • Added tests covering env override parsing, precedence, clamping, and channel sizing behavior.

Expose etcd/raft's MaxInflightMsgs and MaxSizePerMsg as runtime-tunable
knobs via ELASTICKV_RAFT_MAX_INFLIGHT_MSGS and
ELASTICKV_RAFT_MAX_SIZE_PER_MSG, following the same pattern as
ELASTICKV_RAFT_SNAPSHOT_COUNT (#589) and
ELASTICKV_RAFT_MAX_WAL_FILES (#589). Production defaults bump to 1024
and 4 MiB.

Why:

- MaxSizePerMsg 1 MiB -> 4 MiB: under small-entry KV workloads
  (Redis-style, median entry ~500 B) this drastically lowers the
  MsgApp count per committed byte. Fewer dispatcher wake-ups on the
  leader, fewer recv syscalls and less apply-vs-read contention on
  the follower. This is the real win of the change.
- MaxInflightMsgs 256 -> 1024: insurance for burst write phases where
  entries arrive faster than round-trip completion on wide-bandwidth
  LAN. The in-process defaultMaxInflightMsg was already 1024 (see
  stepCh sizing in PR #560), but main.go was still passing 256 at the
  OpenConfig boundary; align the caller with the engine's internal
  expectation.

Semantics:

- Env var wins over caller-supplied cfg so operators can retune a
  running deployment without a rebuild. Invalid values (non-numeric,
  zero/negative, sub-1KiB size) log a warning and fall back to the
  compiled-in default rather than leak through to raft.Config.
- Chosen values are logged at Open at info level so a misconfigured
  override is visible in the boot log.
- MaxSizePerMsg is parsed as plain integer bytes for consistency
  with the other numeric knobs in this package.

Tests cover: defaults when unset, valid override propagation into
raft.Config, invalid values falling back to defaults.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@bootjp has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 46 minutes and 53 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 46 minutes and 53 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f1f2e0e-edad-4502-b00f-22f13be1c140

📥 Commits

Reviewing files that changed from the base of the PR and between 8f72769 and 11cb09b.

📒 Files selected for processing (2)
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/engine_test.go
📝 Walkthrough

Walkthrough

Retunes etcd/raft inbound buffering: increases compiled defaults (MaxInflightMsg -> 512, MaxSizePerMsg -> 2MiB), exposes MaxSizePerMsg on OpenConfig, adds env-driven overrides (ELASTICKV_RAFT_MAX_INFLIGHT_MSGS, ELASTICKV_RAFT_MAX_SIZE_PER_MSG) with validation and clamping, and uses the resolved inflight limit to size inbound channels with a minimum floor. New tests validate parsing, precedence, and channel sizing.

Changes

Cohort / File(s) Summary
Raft engine runtime/config
internal/raftengine/etcd/engine.go
Increases compiled defaults (MaxInflightMsg=512, MaxSizePerMsg=2MiB); adds exported OpenConfig.MaxSizePerMsg; introduces env parsing (ELASTICKV_RAFT_MAX_INFLIGHT_MSGS, ELASTICKV_RAFT_MAX_SIZE_PER_MSG) with safety caps and warnings; extends normalizeLimitConfig to apply env overrides and clamp programmatic values; Open() allocates stepCh and dispatchReportCh using resolved inflight limit and enforces inboundChannelCap floor.
Application bootstrap
main.go
Updates factory/default constants to the new MaxInflightMsg and MaxSizePerMsg values and adds clarifying comments about the byte/memory envelope and relevant env variables.
Tests
internal/raftengine/etcd/engine_test.go
Adds comprehensive tests for env parsing and precedence, validation and clamping behavior, channel capacity flooring, and that Open() uses the normalized/resolved inflight value for channel capacities.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I nibble bytes and count the flow,

Inflight grows, the buffers glow,
Env whispers limits, caps keep pace,
Channels hum in steady space,
A rabbit cheers this tuned embrace 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The pull request title references raising MaxInflightMsgs to 1024 and MaxSizePerMsg to 4MB, but the actual final defaults after reviewer iteration are MaxInflightMsgs=512 and MaxSizePerMsg=2MiB, creating a discrepancy with the implemented changes. Update the pull request title to reflect the actual final defaults: 'perf(raft): raise MaxInflightMsgs=512, MaxSizePerMsg=2MiB defaults' or similar, to accurately represent the changeset.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/raft-message-size-tuning

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the default Raft message size limits from 1 MiB to 4 MiB and the maximum inflight messages from 256 to 1024 to optimize performance for small-entry KV workloads. It also introduces environment variable overrides for these parameters to allow operators to tune replication flow control at runtime without a rebuild. Review feedback identifies a logic discrepancy in the environment variable parsing functions: when an invalid value is provided, the code returns a failure signal that causes the system to retain caller-supplied configuration rather than falling back to the compiled-in defaults as indicated by the warning logs. Suggestions were provided to ensure the fallback behavior matches the logged intent.

Comment on lines +2908 to +2913
n, err := strconv.Atoi(v)
if err != nil || n < 1 {
slog.Warn("invalid ELASTICKV_RAFT_MAX_INFLIGHT_MSGS; using default",
"value", v, "default", defaultMaxInflightMsg)
return 0, false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The warning message indicates that the engine will fall back to the default value (defaultMaxInflightMsg), but the function returns (0, false). This causes normalizeLimitConfig to retain the caller-supplied value from OpenConfig instead of the compiled-in default. If a caller provides a non-default value (e.g., 256) and the environment variable is malformed, the log message will be misleading as it won't actually use the default 1024. To align the implementation with the log message and the PR description, the function should return the default value and true when an invalid override is detected.

Suggested change
n, err := strconv.Atoi(v)
if err != nil || n < 1 {
slog.Warn("invalid ELASTICKV_RAFT_MAX_INFLIGHT_MSGS; using default",
"value", v, "default", defaultMaxInflightMsg)
return 0, false
}
n, err := strconv.Atoi(v)
if err != nil || n < 1 {
slog.Warn("invalid ELASTICKV_RAFT_MAX_INFLIGHT_MSGS; using default",
"value", v, "default", defaultMaxInflightMsg)
return defaultMaxInflightMsg, true
}
References
  1. Avoid state inconsistencies during normalization by ensuring that invalid inputs are handled explicitly and do not lead to misleading states or logs.

Comment on lines +2928 to +2933
n, err := strconv.ParseUint(v, 10, 64)
if err != nil || n < minMaxSizePerMsg {
slog.Warn("invalid ELASTICKV_RAFT_MAX_SIZE_PER_MSG; using default",
"value", v, "min", minMaxSizePerMsg, "default", uint64(defaultMaxSizePerMsg))
return 0, false
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the MaxInflightMsgs helper, this function returns (0, false) on invalid input, which prevents normalizeLimitConfig from applying the compiled-in default if the caller provided a different value. This makes the using default warning message misleading. The function should return the default value and true to ensure the fallback behavior matches the logs.

Suggested change
n, err := strconv.ParseUint(v, 10, 64)
if err != nil || n < minMaxSizePerMsg {
slog.Warn("invalid ELASTICKV_RAFT_MAX_SIZE_PER_MSG; using default",
"value", v, "min", minMaxSizePerMsg, "default", uint64(defaultMaxSizePerMsg))
return 0, false
}
n, err := strconv.ParseUint(v, 10, 64)
if err != nil || n < minMaxSizePerMsg {
slog.Warn("invalid ELASTICKV_RAFT_MAX_SIZE_PER_MSG; using default",
"value", v, "min", minMaxSizePerMsg, "default", uint64(defaultMaxSizePerMsg))
return uint64(defaultMaxSizePerMsg), true
}
References
  1. Avoid state inconsistencies during normalization by ensuring that invalid inputs are handled explicitly and do not lead to misleading states or logs.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98554e6efd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2628 to +2629
if v, ok := maxInflightMsgFromEnv(); ok {
cfg.MaxInflightMsg = v
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Size inbound queues from overridden MaxInflightMsg

When ELASTICKV_RAFT_MAX_INFLIGHT_MSGS is set, this code allows values above 1024, but Open still allocates stepCh and dispatchReportCh with a fixed capacity of defaultMaxInflightMsg (1024). In clusters that tune inflight to values like 2048 (as the new tests demonstrate), inbound traffic can hit the smaller fixed queue and return errStepQueueFull, effectively dropping/deferring raft messages and negating the intended throughput gain under bursty replication load.

Useful? React with 👍 / 👎.

bootjp added 2 commits April 23, 2026 23:13
The maxInflightMsgFromEnv / maxSizePerMsgFromEnv helpers logged "using
default" when the env var was malformed, but returned (0, false) —
telling normalizeLimitConfig to keep whatever the caller supplied. A
caller passing MaxInflightMsg=256 plus a garbage env var would silently
retain 256, making the warning log a lie.

Return (defaultMaxInflightMsg, true) and (defaultMaxSizePerMsg, true)
on invalid input so the default the operator-visible warning promises
is actually the value that reaches raft.Config. Addresses gemini review
feedback on #593.
…ault

Open() allocated stepCh and dispatchReportCh at fixed capacity
defaultMaxInflightMsg (1024). When operators raised
ELASTICKV_RAFT_MAX_INFLIGHT_MSGS (e.g. to 2048) the raft layer honoured
the higher inflight limit, but the inbound channels stayed at 1024, so
bursty replication still tripped errStepQueueFull at 1024 queued
messages — silently negating the tuning knob the PR was meant to
expose.

Size the two channels from prepared.cfg.MaxInflightMsg (already
post-normalizeLimitConfig, so env override and compiled default have
been applied). A new inboundChannelCap() helper clamps to
minInboundChannelCap=256 to preserve a survivable floor if a caller
passes a tiny value. Add TestOpen_InboundChannelsHonourMaxInflightEnv
(env=2048 -> cap=2048) and TestOpen_InboundChannelsDefaultCap
(unset -> cap=1024) plus a unit test for the helper. Addresses codex
P1 review feedback on #593.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 23, 2026

Addressed the two review findings from gemini and codex on this PR.

Commits

  • bb6ff01 fix(raft): apply MaxInflightMsg default on invalid env override — addresses gemini medium x2. maxInflightMsgFromEnv / maxSizePerMsgFromEnv now return (defaultMaxInflightMsg, true) and (defaultMaxSizePerMsg, true) on invalid input instead of (0, false), so the "using default" warning log is no longer a lie when the caller supplied a non-default value.
  • 0546484 fix(raft): size stepCh from resolved MaxInflightMsg, not compiled default — addresses codex P1. Open() previously hard-coded stepCh and dispatchReportCh capacity at defaultMaxInflightMsg (1024). With ELASTICKV_RAFT_MAX_INFLIGHT_MSGS=2048, the raft layer honoured 2048 but the inbound channels stayed at 1024 and tripped errStepQueueFull, silently defeating the knob. Channels are now sized from prepared.cfg.MaxInflightMsg (post-normalizeLimitConfig) via a new inboundChannelCap() helper that clamps to minInboundChannelCap=256 as a survivable floor.

Channel capacity before/after

Env var Before (stepCh / dispatchReportCh cap) After
unset 1024 1024
ELASTICKV_RAFT_MAX_INFLIGHT_MSGS=2048 1024 (bug) 2048
ELASTICKV_RAFT_MAX_INFLIGHT_MSGS=512 1024 512
ELASTICKV_RAFT_MAX_INFLIGHT_MSGS=128 1024 256 (floor)

Tests added / updated

  • TestMaxInflightMsgFromEnv_FallsBackOnInvalid / TestMaxSizePerMsgFromEnv_FallsBackOnInvalid — updated to pin the new (default, true) contract.
  • TestNormalizeLimitConfig_InvalidEnvOverridesCaller — new; invalid env + caller-supplied value resolves to the compiled-in default.
  • TestInboundChannelCap — new; floor / passthrough unit test.
  • TestOpen_InboundChannelsHonourMaxInflightEnv — new; opens a real engine with env=2048 and asserts cap(engine.stepCh) == 2048 and cap(engine.dispatchReportCh) == 2048.
  • TestOpen_InboundChannelsDefaultCap — new; env unset -> cap = defaultMaxInflightMsg.

go test -race ./internal/raftengine/... green. golangci-lint run ./internal/raftengine/... reports 0 issues.

/gemini review
@codex review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the default Raft message size from 1 MiB to 4 MiB and the inflight message limit from 256 to 1024 to optimize performance for small-entry workloads. It introduces environment variable overrides (ELASTICKV_RAFT_MAX_INFLIGHT_MSGS and ELASTICKV_RAFT_MAX_SIZE_PER_MSG) and ensures that internal channel capacities are dynamically sized to match these limits. Review feedback highlighted the increased worst-case memory footprint, which could lead to OOM issues in some environments, and noted a documentation inconsistency where the code falls back to defaults instead of clamping to a floor value.

Comment thread internal/raftengine/etcd/engine.go Outdated
Comment on lines +49 to +50
// Note that with the current defaultMaxSizePerMsg of 4 MiB, the
// true worst-case bound can be much larger (up to roughly 4 GiB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The increase in defaultMaxInflightMsg (1024) and defaultMaxSizePerMsg (4 MiB) significantly raises the worst-case memory footprint. As noted in the comments, this can reach ~4 GiB per peer. In a cluster with multiple peers, the aggregate memory buffered in stepCh and the per-peer outbound replication lanes could lead to OOM under congestion if entries are large. While this improves performance for small-entry workloads, consider if these defaults are too aggressive for general deployments, or if a more conservative MaxSizePerMsg (e.g., 1-2 MiB) would be safer while still providing significant batching benefits.

References
  1. To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.

Comment thread internal/raftengine/etcd/engine.go Outdated
Comment on lines +84 to +85
// of the knob; clamp to this floor rather than rejecting so that a
// fat-fingered operator doesn't take out the engine.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment mentions "clamping" to the floor, but the implementation in maxSizePerMsgFromEnv (line 2961) falls back to the defaultMaxSizePerMsg (4 MiB) if the value is below the floor (1 KiB). Clamping usually implies using the floor value itself. Given the large gap between the floor and the default, this behavior might be unexpected for an operator trying to tune for low memory.

Suggested change
// of the knob; clamp to this floor rather than rejecting so that a
// fat-fingered operator doesn't take out the engine.
// of the knob; fall back to the default rather than rejecting so that a
// fat-fingered operator doesn't take out the engine.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

bootjp added 2 commits April 23, 2026 23:46
The comment on minMaxSizePerMsg said the override clamps to the floor,
but maxSizePerMsgFromEnv actually falls back to defaultMaxSizePerMsg when
the env value is below the floor (or unparseable). Clamp usually implies
using the floor itself, so the doc misled readers about the behaviour.

Rewrite the sentence to match the implementation: fall back to the
default rather than rejecting, so a fat-fingered operator doesn't take
out the engine. No code change.
…se memory at 2 GiB/peer

Gemini reviewer flagged that 1024 inflight x 4 MiB per message is a
4 GiB per-peer worst-case buffered footprint. With N peers and bursty
replication traffic, the aggregate can plausibly reach OOM territory on
general-purpose deployments where the operator has not tuned the knob
to the workload.

In practice TCP backpressure on a typical LAN will cap the actually
buffered bytes well below the product, but the reviewer's framing is
reasonable: a compiled-in default should be conservative.

Take option B from the gemini response matrix: keep MaxSizePerMsg at
4 MiB (the MsgApp-batching win that motivates this PR's narrative on
small-entry KV workloads) and drop MaxInflightMsgs back to 512. This
halves the worst-case product to 2 GiB per peer while preserving the
batching improvement. 512 is still a 2x safety margin over the pre-#529
default of 256 that we observed filling up on followers under the
pebble-seek-storm incident (PRs #560/#562/#563/#565 have since removed
most of that CPU contention, so the deeper pipelining of 1024 was
motivated more by belt-and-suspenders than measured need).

Operators who need deeper pipelines (wide-bandwidth LAN, plenty of RAM)
can still raise the knob via ELASTICKV_RAFT_MAX_INFLIGHT_MSGS without
a rebuild; operators who need a tighter memory budget can lower
MaxSizePerMsg via ELASTICKV_RAFT_MAX_SIZE_PER_MSG.

Update the in-code rationale comment, the caller-side constant in
main.go (so the engine default is actually exercised when no env
override is set), and the TestNormalizeLimitConfig_DefaultsWhenUnset
pin from 1024 -> 512. Drop the stale "(1024)" parenthetical in the
TestOpen_InboundChannelsHonourMaxInflightEnv docblock.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 23, 2026

Addressed both gemini medium items.

Item 1 — memory-footprint concern (option B)

Commit a248131a — lowered defaultMaxInflightMsg from 1024 back to 512, kept defaultMaxSizePerMsg at 4 MiB. Worst-case per-peer buffered product drops from 4 GiB to 2 GiB.

Why B over A: the existing code comment explicitly frames the 4 MiB bump as a batching win (amortise more entries per MsgApp, cut dispatcher wake-ups / recv syscalls on small-entry KV). Dropping MaxSizePerMsg to 2 MiB would undo that narrative. The 1024 bump was framed as a 4x belt-and-suspenders margin over the 256-slot bottleneck seen during the pebble-seek-storm incident, which PRs #560/#562/#563/#565 have since largely removed. 512 still gives a 2x safety margin over the pre-#529 default of 256 while preserving the MsgApp batching improvement. Operators who want deeper pipelining on wide-bandwidth LAN can still raise via ELASTICKV_RAFT_MAX_INFLIGHT_MSGS without a rebuild.

Also updated main.go's etcdMaxInflightMsg caller-side constant so the engine default is actually exercised, and pinned the test expectation from 1024 to 512.

Item 2 — clamping doc fix

Commit ca6a4151minMaxSizePerMsg comment now says "fall back to the default rather than rejecting" instead of "clamp to this floor", matching the implementation in maxSizePerMsgFromEnv which returns defaultMaxSizePerMsg on sub-floor / malformed input.

Verification

  • go test -race ./internal/raftengine/... clean (17.4s)
  • golangci-lint run ./internal/raftengine/... . clean

/gemini review

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a248131afe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/engine.go Outdated
// memory is bounded by O(numPeers * MaxInflightMsg * avgMsgSize).
// Default: 256. Increase for deeper pipelining on high-bandwidth links;
// lower in memory-constrained clusters.
// Default: 1024. Increase for deeper pipelining on high-bandwidth links;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Update documented MaxInflightMsg default to match code

The OpenConfig.MaxInflightMsg docstring says the default is 1024, but this commit sets defaultMaxInflightMsg to 512 and tests pin that value. This mismatch can mislead operators and library users who rely on GoDoc/comments for tuning decisions, especially when diagnosing queue pressure or memory footprint. Please update the comment to reflect the actual default.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Raft engine's default flow-control parameters, increasing the maximum message size to 4 MiB and reducing the maximum in-flight messages to 512. It introduces environment variable overrides for these settings to allow runtime tuning and ensures that internal channel capacities are dynamically sized based on these limits. Review feedback identifies a documentation mismatch regarding the new default in-flight message count and suggests refining the logging logic to avoid redundancy in multi-shard deployments.

Comment thread internal/raftengine/etcd/engine.go Outdated
// memory is bounded by O(numPeers * MaxInflightMsg * avgMsgSize).
// Default: 256. Increase for deeper pipelining on high-bandwidth links;
// lower in memory-constrained clusters.
// Default: 1024. Increase for deeper pipelining on high-bandwidth links;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment indicates a default value of 1024, but the actual compiled-in default defaultMaxInflightMsg has been set to 512 in this PR (line 60). This mismatch can mislead callers who rely on the documentation to understand the behavior when the field is left at its zero value.

Suggested change
// Default: 1024. Increase for deeper pipelining on high-bandwidth links;
// Default: 512. Increase for deeper pipelining on high-bandwidth links;

Comment on lines +2665 to +2668
slog.Info("etcd raft engine: message size limits",
"max_inflight_msgs", cfg.MaxInflightMsg,
"max_size_per_msg_bytes", cfg.MaxSizePerMsg,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Logging the message size limits at Info level inside normalizeLimitConfig will produce redundant log entries for every shard opened (e.g., 100+ lines of identical output in a typical multi-shard deployment). Consider logging this once per process or only when the values deviate from the compiled-in defaults to reduce log noise.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 23, 2026

Lowered defaultMaxSizePerMsg from 4 MiB to 2 MiB (commit 4edd02d) to address the OOM concern raised on this PR.

Worst-case memory math

With MaxInflightMsgs=512 (from the prior commit):

per-peer 5-node leader (4 followers)
old (4 MiB) 512 × 4 MiB = 2 GiB 8 GiB
new (2 MiB) 512 × 2 MiB = 1 GiB 4 GiB

This fits comfortably inside the 4–16 GiB RAM envelope of typical elastickv nodes, while still doubling etcd/raft's 1 MiB upstream default.

Batching impact is marginal

On the small-entry KV workload this PR targets (median ~500 B), 2 MiB / 500 B ≈ 4000 entries per MsgApp — well past the point where per-RPC amortisation saturates. 4 MiB bought essentially nothing on top of that.

Changes

  • internal/raftengine/etcd/engine.go: defaultMaxSizePerMsg = 2 << 20, rationale block updated to reflect 512 × 2 MiB = 1 GiB worst case.
  • main.go: etcdMaxSizePerMsg = 2 << 20 aligned with the engine default, block comment updated.
  • internal/raftengine/etcd/engine_test.go: TestNormalizeLimitConfig_DefaultsWhenUnset pinned to 2 MiB.

Note that MaxCommittedSizePerReady tracks MaxSizePerMsg (they share the same field in the raft.Config wiring at engine.go L613-614), so no separate change is needed.

Verification

  • go test -race ./internal/raftengine/... . — pass
  • golangci-lint run ./internal/raftengine/... — 0 issues

Per the user's explicit confirmation that MaxInflightMsgs=512, MaxSizePerMsg=2 MiB is the intended operating point.

/gemini review
@codex review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adjusts Raft flow control parameters by setting defaultMaxInflightMsg to 512 and defaultMaxSizePerMsg to 2 MiB to optimize memory usage within typical node envelopes. It introduces environment variable overrides for these settings, allowing runtime tuning without rebuilds, and ensures that internal channel capacities are sized according to the resolved limits. Review feedback correctly identifies discrepancies in the documentation regarding these new default values and suggests enhancing log messages with the local engine ID for better observability in sharded environments.

Comment thread internal/raftengine/etcd/engine.go Outdated
HeartbeatTick int
StateMachine StateMachine
// MaxSizePerMsg caps the byte size of a single MsgApp payload (Raft-level
// flow control). Default: 4 MiB. Larger values amortise more entries per
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for MaxSizePerMsg states a default of 4 MiB, but the constant defaultMaxSizePerMsg is defined as 2 MiB (line 88). Please update the documentation to match the implementation. Maintaining an accurate and strictly enforced maximum size limit is essential when pre-allocating buffers for deserialization.

Suggested change
// flow control). Default: 4 MiB. Larger values amortise more entries per
// flow control). Default: 2 MiB. Larger values amortise more entries per
References
  1. Pre-allocating a buffer is acceptable if a strictly enforced maximum size limit is checked before allocation and the entire data must be materialized in memory.

Comment thread internal/raftengine/etcd/engine.go Outdated
// memory is bounded by O(numPeers * MaxInflightMsg * avgMsgSize).
// Default: 256. Increase for deeper pipelining on high-bandwidth links;
// lower in memory-constrained clusters.
// Default: 1024. Increase for deeper pipelining on high-bandwidth links;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for MaxInflightMsg states a default of 1024, but the constant defaultMaxInflightMsg is set to 512 (line 64). This should be updated to avoid confusing callers. Applying fixed bounds to collections that grow from external requests is necessary to prevent unbounded memory growth and OOM issues.

Suggested change
// Default: 1024. Increase for deeper pipelining on high-bandwidth links;
// Default: 512. Increase for deeper pipelining on high-bandwidth links;
References
  1. To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests.

Comment on lines +2679 to +2682
slog.Info("etcd raft engine: message size limits",
"max_inflight_msgs", cfg.MaxInflightMsg,
"max_size_per_msg_bytes", cfg.MaxSizePerMsg,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In sharded environments where multiple Raft engines may be active, it is helpful to include the local_id in this log message to identify which engine's limits are being reported.

Suggested change
slog.Info("etcd raft engine: message size limits",
"max_inflight_msgs", cfg.MaxInflightMsg,
"max_size_per_msg_bytes", cfg.MaxSizePerMsg,
)
slog.Info("etcd raft engine: message size limits",
"local_id", cfg.LocalID,
"max_inflight_msgs", cfg.MaxInflightMsg,
"max_size_per_msg_bytes", cfg.MaxSizePerMsg,
)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4edd02d5c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/engine.go Outdated
// memory is bounded by O(numPeers * MaxInflightMsg * avgMsgSize).
// Default: 256. Increase for deeper pipelining on high-bandwidth links;
// lower in memory-constrained clusters.
// Default: 1024. Increase for deeper pipelining on high-bandwidth links;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Correct stale defaults in OpenConfig docs

The exported OpenConfig field comments now advertise MaxSizePerMsg defaulting to 4 MiB and MaxInflightMsg defaulting to 1024, but normalizeLimitConfig and the package constants set the real defaults to 2 MiB and 512. This mismatch can mislead operators and integrators doing memory sizing or troubleshooting, because they will believe the runtime is using a different operating point than it actually is.

Useful? React with 👍 / 👎.

@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 23, 2026

Addressed codex P3: OpenConfig doc comments in internal/raftengine/etcd/engine.go now reflect the real defaults (MaxSizePerMsg=2 MiB, MaxInflightMsg=512) instead of the stale 4 MiB / 1024 values.

Files corrected:

  • internal/raftengine/etcd/engine.go (OpenConfig MaxSizePerMsg and MaxInflightMsg field comments)

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/raftengine/etcd/engine_test.go`:
- Around line 1990-2024: The tests call Open() which normalizes both raft env
vars and can pick up an ambient ELASTICKV_RAFT_MAX_SIZE_PER_MSG; ensure test
isolation by clearing that size env var before calling Open() in both
TestOpen_InboundChannelsHonourMaxInflightEnv and
TestOpen_InboundChannelsDefaultCap—add t.Setenv(maxSizePerMsgEnvVar, "") (in
addition to the existing t.Setenv(maxInflightMsgEnvVar, ...)) immediately before
creating the engine so Open() sees an empty/cleared max size env var.

In `@internal/raftengine/etcd/engine.go`:
- Around line 493-501: Update the outdated numeric value in the explanatory
comment referencing channel sizing: instead of saying the channel would be
silently capped at 1024, update the wording to reflect the current
defaultMaxInflightMsg value (512) and ensure the comment still explains that
using defaultMaxInflightMsg would cap the channel at that default even when the
Raft layer (MaxInflightMsg via normalizeLimitConfig /
ELASTICKV_RAFT_MAX_INFLIGHT_MSGS) requests a larger buffer, potentially
re-triggering errStepQueueFull under bursty conditions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 99d7c0ca-4f3e-48e8-a666-69a3894c3c45

📥 Commits

Reviewing files that changed from the base of the PR and between 306eb0b and b5c5298.

📒 Files selected for processing (3)
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/engine_test.go
  • main.go

Comment thread internal/raftengine/etcd/engine_test.go
Comment thread internal/raftengine/etcd/engine.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5c5298a21

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2954 to +2956
n, err := strconv.Atoi(v)
if err != nil || n < 1 {
slog.Warn("invalid ELASTICKV_RAFT_MAX_INFLIGHT_MSGS; using default",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cap MaxInflight env override to avoid startup OOM

maxInflightMsgFromEnv accepts any positive integer, but Open uses the resolved value to allocate stepCh, dispatchReportCh, and per-peer dispatch queues, so a fat-fingered value like ELASTICKV_RAFT_MAX_INFLIGHT_MSGS=100000000 can trigger massive channel allocations and crash the process before the node becomes healthy. Since this knob is now operator-facing at runtime, it needs a sane upper bound (or rejection) rather than only checking n < 1.

Useful? React with 👍 / 👎.

Comment on lines +2976 to +2980
n, err := strconv.ParseUint(v, 10, 64)
if err != nil || n < minMaxSizePerMsg {
slog.Warn("invalid ELASTICKV_RAFT_MAX_SIZE_PER_MSG; using default",
"value", v, "min", minMaxSizePerMsg, "default", uint64(defaultMaxSizePerMsg))
return uint64(defaultMaxSizePerMsg), true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce transport-safe upper bound for MaxSizePerMsg env

maxSizePerMsgFromEnv currently accepts any value >= 1 KiB, but the transport path in this repo is bounded by gRPC message-size limits (internal.GRPCMaxMessageBytes = 64 MiB via server/dial options), so setting ELASTICKV_RAFT_MAX_SIZE_PER_MSG above that can make Raft emit MsgApp payloads the transport cannot carry, causing repeated send failures/unreachable reports under large batches. The env parser should reject or clamp values above the transport budget.

Useful? React with 👍 / 👎.

Codex P2s on the env-tuning knobs: maxInflightMsgFromEnv accepted any
positive integer and maxSizePerMsgFromEnv accepted anything >= 1 KiB,
so an operator typo (e.g. ELASTICKV_RAFT_MAX_INFLIGHT_MSGS=100000000)
could either allocate multi-GB of channel memory at Open() and crash
the process, or force MsgApp payloads larger than the gRPC transport
budget (internal.GRPCMaxMessageBytes = 64 MiB). Both are now clamped
to the compiled-in default with a warning.

- maxMaxInflightMsg (8192) gates the inflight env override. stepCh,
  dispatchReportCh, and every per-peer dispatch queue size from this
  value, so the upper bound is load-bearing for startup memory.
- maxMaxSizePerMsg (64 MiB) equals GRPCMaxMessageBytes and is pinned
  by a new test so a transport-budget change that does not update the
  Raft cap fails CI instead of silently allowing unsendable payloads.
- Clamp boundary tests (at cap and cap+1) catch off-by-one regressions.

Also addresses CodeRabbit minors flagged on the PR:
- Clear ELASTICKV_RAFT_MAX_SIZE_PER_MSG in TestOpen_InboundChannels*
  to avoid inheriting ambient shell env from the operator running the
  suite locally.
- Replace the stale "1024" inflight-cap comment in Open() (now 512).

The upper-bound fallback matches the existing lower-bound fallback
policy: return (default, true) so normalizeLimitConfig actually applies
the compiled-in default the warning log promises, rather than letting
a caller-supplied value silently win.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed review feedback in d60cdcf:

Codex P2 (x2) — env-override safety:

  • maxInflightMsgFromEnv now clamps values above maxMaxInflightMsg (8192) back to the compiled-in default with a warning. Prevents a fat-fingered ELASTICKV_RAFT_MAX_INFLIGHT_MSGS=1e8 from allocating multi-GB of channel memory at Open() and crashing startup.
  • maxSizePerMsgFromEnv now clamps values above maxMaxSizePerMsg (64 MiB = internal.GRPCMaxMessageBytes) back to the default. Prevents Raft from emitting MsgApp payloads the gRPC transport physically cannot carry. TestMaxMaxSizePerMsg_MatchesTransportBudget pins the cap to the transport budget so a transport-budget change without a paired Raft-cap update fails CI.

CodeRabbit minors (x2):

  • TestOpen_InboundChannels* now explicitly clears ELASTICKV_RAFT_MAX_SIZE_PER_MSG alongside the inflight var so an ambient shell env cannot influence the channel-cap assertions.
  • Replaced the stale "silently cap at 1024" comment on the inboundChannelCap call with the current value (512 — today defaultMaxInflightMsg).

Boundary tests (at cap and cap+1) added for both overrides to catch off-by-one regressions.

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d60cdcffb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +3022 to +3023
if n > maxMaxSizePerMsg {
slog.Warn("ELASTICKV_RAFT_MAX_SIZE_PER_MSG exceeds transport budget; using default",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reserve gRPC overhead when validating MaxSizePerMsg

The n > maxMaxSizePerMsg check accepts ELASTICKV_RAFT_MAX_SIZE_PER_MSG=64<<20, but etcd/raft's MaxSizePerMsg bounds the encoded entries payload, not the full serialized raftpb.Message envelope. In configurations that push batches near this boundary, the resulting MsgApp can exceed gRPC's 64 MiB transport limit and get rejected (ResourceExhausted), causing repeated replication send failures even though the value passed validation. This limit should leave explicit headroom (or validate against actual marshaled message size) instead of allowing the transport ceiling exactly.

Useful? React with 👍 / 👎.

Codex P2 follow-up on d60cdcf: etcd-raft MaxSizePerMsg caps the
entries-data size per MsgApp, not the full serialized raftpb.Message
envelope. The envelope adds Term/Index/From/To/LogTerm/Commit plus
per-entry framing, so a batch that exactly hits MaxSizePerMsg
serializes to a frame slightly larger than MaxSizePerMsg. With the
previous cap (maxMaxSizePerMsg == GRPCMaxMessageBytes == 64 MiB), a
full-sized batch could tip the frame a few KiB past the transport
limit and replication would fail with ResourceExhausted even though
the env-var value passed our validation.

Clamp the override at GRPCMaxMessageBytes - raftMessageEnvelopeHeadroom
(1 MiB) so the accepted set is strictly smaller than the transport
budget. 1 MiB is an order of magnitude over the observed wire
overhead for 8192-entry batches (tens of KiB), leaving the transport
budget as the single source of truth while making the env override a
safe bound under real batching.

Renamed the pinning test to TestMaxMaxSizePerMsg_ReservesEnvelopeHeadroom
and tightened it to assert strict-less-than GRPCMaxMessageBytes.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Addressed Codex P2 follow-up in 639ef08:

Reserve gRPC envelope headroom on MaxSizePerMsg. Right — etcd/raft MaxSizePerMsg caps the entries-data size per MsgApp, not the full raftpb.Message envelope. The envelope adds Term/Index/From/To/LogTerm/Commit plus per-entry framing, so a batch that exactly hits MaxSizePerMsg serializes to a frame slightly larger than MaxSizePerMsg. With the prior cap (maxMaxSizePerMsg == GRPCMaxMessageBytes exactly), a full-sized batch could tip the serialized frame past the 64 MiB transport limit and fail replication with ResourceExhausted even though env validation had accepted the value.

Now clamped at GRPCMaxMessageBytes - raftMessageEnvelopeHeadroom (1 MiB), so accepted overrides are strictly smaller than the transport budget. 1 MiB is an order of magnitude over the observed wire overhead for 8192-entry batches (tens of KiB), keeping the transport budget as the single source of truth while making the env override a safe bound under real batching.

The pinning test (renamed TestMaxMaxSizePerMsg_ReservesEnvelopeHeadroom) now asserts strict-less-than GRPCMaxMessageBytes so a future change that collapses the headroom trips CI.

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 639ef081b4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/engine.go Outdated
// is added. Clamping to (budget - headroom) makes the override a
// *safe* bound under real batching. Values above this clamp back
// to the default with a warning.
maxMaxSizePerMsg = uint64(internalutil.GRPCMaxMessageBytes) - raftMessageEnvelopeHeadroom
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive MaxSizePerMsg cap from worst-case protobuf overhead

The maxMaxSizePerMsg calculation assumes a fixed 1 MiB envelope margin, but raftpb.Message overhead grows with entry count as well as payload bytes. With tiny-entry batches, a value that passes maxSizePerMsgFromEnv can still serialize above the 64 MiB gRPC limit (e.g., 1-byte entries at a 63 MiB entry budget add many MiB of per-entry framing), so replication can still hit ResourceExhausted despite validation succeeding. This cap needs to be based on worst-case encoded MsgApp size (or a substantially larger bound tied to entry count), not a fixed subtraction.

Useful? React with 👍 / 👎.

Comment on lines 2704 to 2705
if cfg.MaxInflightMsg <= 0 {
cfg.MaxInflightMsg = defaultMaxInflightMsg
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clamp caller MaxInflightMsg before channel allocation

normalizeLimitConfig only defaults cfg.MaxInflightMsg when it is non-positive, but it never applies the new upper safety bound to caller-supplied values. Since Open now allocates stepCh and dispatchReportCh from prepared.cfg.MaxInflightMsg, a fat-fingered programmatic config (for example 100000000) can trigger huge channel allocations and crash startup even though env overrides are capped. Apply maxMaxInflightMsg (or equivalent) to direct config input before these allocations.

Useful? React with 👍 / 👎.

…tConfig

Codex P2 follow-up on 639ef08: env overrides go through the
maxMaxInflightMsg / maxMaxSizePerMsg upper bounds, but caller-supplied
OpenConfig values (programmatic configuration via the Go API) skipped
the clamp entirely. With the prior code a fat-fingered
OpenConfig{MaxInflightMsg: 100_000_000} would have bypassed the env
guard and triggered multi-GB stepCh / dispatchReportCh allocations at
Open() — defeating the whole point of the upper bounds.

normalizeLimitConfig now applies the same clamp policy to caller
values as to env values: exceeding the upper bound logs a warning and
falls back to the compiled-in default. The env override step still
runs first (operator wins), so a healthy operator-tuned env value is
not affected.

Three new tests cover:
- TestNormalizeLimitConfig_ClampsCallerOverlimitInflight: 100M is
  refused, default is applied.
- TestNormalizeLimitConfig_ClampsCallerOverlimitSize: symmetric.
- TestNormalizeLimitConfig_AcceptsCallerAtCap: boundary value (exactly
  maxMax*) passes through unchanged.

Re the other Codex P2 ("derive cap from worst-case protobuf overhead"):
declined silently. etcd-raft limitSize already enforces the cap
inclusive of per-entry framing (it sums ent.Size() which is the
encoded protobuf size, not just payload bytes). The MsgApp envelope
adds only the outer Message fields (Term/Index/From/To/etc.) which
are bounded by ~hundreds of bytes, well within the 1 MiB headroom
already reserved. The "tiny entries add many MiB of framing" claim
double-counts the per-entry overhead etcd-raft already includes.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Round 3 in 8f72769:

Codex P2 — clamp caller-supplied OpenConfig values. Right, the env-side guards in maxInflightMsgFromEnv / maxSizePerMsgFromEnv only catch values coming through the env vars; a programmatic caller passing OpenConfig{MaxInflightMsg: 100_000_000} skipped the clamp entirely and would have triggered multi-GB channel allocations at Open(). normalizeLimitConfig now applies maxMaxInflightMsg / maxMaxSizePerMsg to the resolved values too: exceeding either upper bound falls back to the compiled-in default with a warning. Env overrides still run first, so legitimate operator tuning is unaffected. Boundary tests (caller value at cap and cap+1) pin the new behaviour.

Codex P2 — declined silently: "derive cap from worst-case protobuf overhead". Re-checked against the etcd-raft source. limitSize already enforces MaxSizePerMsg using ent.Size() — the encoded protobuf size of each entry, including Type / Term / Index / Data-length framing — not just payload bytes. The MsgApp envelope above MaxSizePerMsg only adds the outer Message fields (Type/To/From/Term/LogTerm/Index/Commit/Reject/RejectHint, plus protobuf wrapping), which is bounded by hundreds of bytes — well within the 1 MiB headroom this PR already reserves. The "tiny entries add many MiB of framing" claim double-counts the per-entry overhead etcd-raft already includes in its limit, so no further adjustment is warranted.

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f72769e01

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/engine.go Outdated
// is added. Clamping to (budget - headroom) makes the override a
// *safe* bound under real batching. Values above this clamp back
// to the default with a warning.
maxMaxSizePerMsg = uint64(internalutil.GRPCMaxMessageBytes) - raftMessageEnvelopeHeadroom
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive MaxSizePerMsg cap from worst-case envelope growth

maxMaxSizePerMsg is currently computed as GRPCMaxMessageBytes - 1MiB, but that fixed headroom is not sufficient when entries are very small. etcd-raft’s batching limit uses Entry.Size() per entry, while protobuf encoding of raftpb.Message also adds per-entry repeated-field framing (tag + len-varint) on top; this overhead grows with entry count, so a config value accepted at this cap can still serialize beyond gRPC’s 64 MiB limit and fail replication with ResourceExhausted under tiny-entry workloads. The cap should be based on a worst-case per-entry envelope cost (or enforced against encoded Message.Size()), not a constant subtraction.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/raftengine/etcd/engine_test.go (1)

1848-1911: Optional: consolidate the three-case-per-helper tests into one table-driven test.

The TestMaxInflightMsgFromEnv_Unset/ReadsOverride/FallsBackOnInvalid trio (and its MaxSizePerMsg twin) could be folded into a single table-driven test per helper with fields {name, envVal, wantOK, wantN}. Same structure, fewer boilerplate t.Setenv(...); helper(); require.Equal(...) blocks, and table-driven is the preferred style per the repo guidelines. Not required — current form is readable and each test documents a distinct contract. As per coding guidelines, "Unit tests must be co-located with packages in *_test.go files; prefer table-driven test cases".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/raftengine/etcd/engine_test.go` around lines 1848 - 1911, Replace
each trio of tests
TestMaxInflightMsgFromEnv_Unset/ReadsOverride/FallsBackOnInvalid with a single
table-driven test that iterates cases with fields {name, envVal, wantOK, wantN},
sets the env var via t.Setenv(maxInflightMsgEnvVar, envVal), calls
maxInflightMsgFromEnv(), and asserts ok==wantOK and n==wantN using require; do
the same replacement for the MaxSizePerMsg trio using maxSizePerMsgEnvVar and
maxSizePerMsgFromEnv() (use uint64 for wantN and reference
defaultMaxInflightMsg/defaultMaxSizePerMsg and minMaxSizePerMsg in the
appropriate cases).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/raftengine/etcd/engine_test.go`:
- Around line 1848-1911: Replace each trio of tests
TestMaxInflightMsgFromEnv_Unset/ReadsOverride/FallsBackOnInvalid with a single
table-driven test that iterates cases with fields {name, envVal, wantOK, wantN},
sets the env var via t.Setenv(maxInflightMsgEnvVar, envVal), calls
maxInflightMsgFromEnv(), and asserts ok==wantOK and n==wantN using require; do
the same replacement for the MaxSizePerMsg trio using maxSizePerMsgEnvVar and
maxSizePerMsgFromEnv() (use uint64 for wantN and reference
defaultMaxInflightMsg/defaultMaxSizePerMsg and minMaxSizePerMsg in the
appropriate cases).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 68a8ef10-d9bb-41f2-bb89-d0047a5d8308

📥 Commits

Reviewing files that changed from the base of the PR and between b5c5298 and 8f72769.

📒 Files selected for processing (2)
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/engine_test.go

Codex P2 (re-flagged): the prior cap (GRPCMaxMessageBytes - 1 MiB)
was based on a static envelope-overhead estimate, but etcd-raft's
limitSize counts only Entry.Size() per entry. The serialized
raftpb.Message that crosses the wire ALSO carries per-entry framing
inside its Entries repeated field — `1 byte (field tag) + sovRaft(len)
+ Entry payload` per entry — which limitSize does not account for.

For minimal entries (Term + Index + Type tags+varints = ~4 B,
empty Data), per-entry framing is 1 + sovRaft(4) = 2 B. So
tiny-entry workloads can encode at up to (4+2)/4 = 1.5x the entries-
data budget on the wire. With the prior 1 MiB headroom (a fixed
subtraction), a config of MaxSizePerMsg = GRPCMaxMessageBytes - 1 MiB
serializing tiny entries could still ResourceExhaust gRPC's 64 MiB
transport limit.

Replaced the fixed headroom with a divisor: maxMaxSizePerMsg =
GRPCMaxMessageBytes / 2. This admits the worst-case 1.5x growth
(32 MiB entries → 48 MiB on wire ≪ 64 MiB transport) and is the
minimum cap that keeps the env override safe under any entry-size
distribution.

Re-derived TestMaxMaxSizePerMsg_ReservesEnvelopeHeadroom against
the new divisor; added TestMaxMaxSizePerMsg_AbsorbsTinyEntryFraming
as a quantitative pin so a future change that raises the divisor
below 2 (which would re-introduce the tiny-entry failure mode)
fails CI.

The earlier "1 MiB headroom" reasoning was incomplete — corrected
here.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented Apr 24, 2026

Round 4 in 11cb09b — Codex was right, my prior justification was wrong.

Corrected: halve MaxSizePerMsg cap to absorb tiny-entry framing. Re-checked the etcd-raft source carefully: limitSize counts only Entry.Size() per entry, but raftpb.Message.Size() also adds per-entry framing inside the Entries repeated field — 1 byte (field tag) + sovRaft(len) + Entry payload. For a minimal Entry (Term + Index + Type tags+varints, no Data, ~4 B), the framing is 2 B. So tiny-entry workloads encode at up to (4+2)/4 = 1.5x the entries-data budget on the wire. With the prior 1 MiB fixed headroom, a config of MaxSizePerMsg = GRPCMaxMessageBytes - 1 MiB on a tiny-entry workload could still serialize past the 64 MiB transport limit and fail with ResourceExhausted — exactly the failure mode the previous Codex P2 flagged.

Replaced the fixed-subtraction headroom with a divisor: maxMaxSizePerMsg = GRPCMaxMessageBytes / 2 = 32 MiB. This admits the worst-case 1.5x growth (32 MiB entries → 48 MiB on wire, well below 64 MiB transport) and is the minimum cap that keeps the env override safe under any entry-size distribution.

TestMaxMaxSizePerMsg_AbsorbsTinyEntryFraming added as a quantitative pin against the worst-case ratio so a future change that raises the divisor below 2 fails CI. Apologies for the prior decline-with-justification — the 1 MiB headroom analysis was incomplete because I had double-counted Entry.Size() with the per-Entry Message-level framing.

/gemini review
@codex review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

bootjp added a commit that referenced this pull request Apr 25, 2026
…s=4 (#636)

## Summary

S3 PutObject の Raft entry サイズを `MaxSizePerMsg` (PR #593 で 4 MiB)
と整列させます。

**変更:** `s3ChunkBatchOps = 16 → 4`。1 Raft entry = `s3ChunkBatchOps ×
s3ChunkSize ≒ 4 MiB` に揃える。1 行差分。

## Why

`etcd/raft` の `util.go:limitSize` には「単一 entry が `MaxSizePerMsg`
を超えていても、その entry だけは reject せず単独で MsgApp に載せる」という documented exception
があります (`"if the size of the first entry exceeds maxSize, a non-empty
slice with just this entry is returned"`). 16 MiB entry はこの経路で素通りするので:

| 項目 | s3ChunkBatchOps=16 | s3ChunkBatchOps=4 (本 PR) |
|---|---|---|
| 1 Raft entry size | ~16 MiB | ~4 MiB |
| Leader worst-case / peer | 1024 × 16 MiB = **16 GiB** | 1024 × 4 MiB =
**4 GiB** |
| 3-node cluster (leader) | 32 GiB | 8 GiB |
| WAL fsync per entry | 16 MiB | 4 MiB |
| 5 GiB PUT の Raft commit 数 | 320 | 1280 (4×) |

PR #593 の本文が謳う `1024 × 4 MiB = 4 GiB / peer` の memory bound は
**小エントリ前提**で、S3 経路では成立していませんでした。本 PR で S3 が普通の batched MsgApp
経路に乗るようになり、bound が S3 込みでも正確になります。

Raft commit 数が 4× に増える代わりに、各 fsync が 4× 速くなるので、PR #600 の WAL group
commit と相殺されてエンドツーエンド throughput はほぼ同等のはずです。

## Test plan

- [x] `go build ./...` clean
- [x] `golangci-lint run ./...` 0 issues
- [x] `go test ./adapter/ -short -run 'TestS3|S3Server'` pass
- [ ] CI

## Follow-ups (別 PR で design doc)

1. **S3 admission control** — 同時並行の S3 PUT body bytes に hard cap
を入れて、クライアント並列度が上がっても leader-side memory が `4 GiB × peers` を超えないようにする。
2. **Raft snapshot 戦略 / blob bypass** — follower fall-behind 時の snapshot
transfer に 5 GiB blob が乗ると不都合なので、blob を Raft 経路から外して manifest だけ raft
で同期するアーキテクチャ。

/gemini review
@codex review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant